Skip to content

GH-15483: [C++] Add a Fixed Shape Tensor canonical ExtensionType#8510

Merged
jorisvandenbossche merged 39 commits intoapache:mainfrom
rok:ARROW-1614
Apr 4, 2023
Merged

GH-15483: [C++] Add a Fixed Shape Tensor canonical ExtensionType#8510
jorisvandenbossche merged 39 commits intoapache:mainfrom
rok:ARROW-1614

Conversation

@rok
Copy link
Member

@rok rok commented Oct 22, 2020

ARROW-1614: In an Arrow table, we would like to add support for a column that has values cells each containing a tensor value, with all tensors having the same dimensions. These would be stored as a binary value, plus some metadata to store type and shape/strides.

@github-actions
Copy link

@rok rok marked this pull request as draft October 22, 2020 16:54
@jorisvandenbossche
Copy link
Member

Currently, only the shape is stored. Is this enough? That does a assume a fixed row major order?

@rok
Copy link
Member Author

rok commented Oct 30, 2020

Currently, only the shape is stored. Is this enough? That does a assume a fixed row major order?

I think we either assume that or also store strides / dimension order. I am not sure how dimension order changes are done in other frameworks (TF, pytorch, etc.) but I would assume they don't reorder tensors in memory. So I would go for storing strides.

@sjperkins
Copy link
Contributor

In the context of testing metadata equality withinin multiple parquet files in a dataset, equality on shape and strides may be a very strict requirement. Would relaxing the equality requirement to only compare the number of tensor dimensions negatively impact the design?

@rok
Copy link
Member Author

rok commented Oct 5, 2021

In the context of testing metadata equality withinin multiple parquet files in a dataset, equality on shape and strides may be a very strict requirement. Would relaxing the equality requirement to only compare the number of tensor dimensions negatively impact the design?

Good point. By tensor dimensions you mean shape, right?
I think it's ok to relax on the strides check. I've pushed a change, see latest commit.

@rok rok marked this pull request as ready for review October 5, 2021 09:36
@sjperkins
Copy link
Contributor

In the context of testing metadata equality withinin multiple parquet files in a dataset, equality on shape and strides may be a very strict requirement. Would relaxing the equality requirement to only compare the number of tensor dimensions negatively impact the design?

Good point. By tensor dimensions you mean shape, right? I think it's ok to relax on the strides check. I've pushed a change, see latest commit.

I was thinking even looser:

def __eq__(self, other):
    len(self.shape) == len(other.shape)

@rok
Copy link
Member Author

rok commented Oct 5, 2021

I was thinking even looser:

def __eq__(self, other):
    len(self.shape) == len(other.shape)

Done.
We could introduce comparison options in case there would be differing requirements here.

@rok
Copy link
Member Author

rok commented Dec 10, 2021

@jorisvandenbossche @sjperkins @pitrou is there interest to get this in?
If yes is cpp/src/arrow/extension_type_test.cc a good place to put it?

@pitrou
Copy link
Member

pitrou commented Dec 13, 2021

Currently we don't ship any standard extension types. I recommend discussing this on the mailing-list.

@Hoeze
Copy link

Hoeze commented Jan 19, 2022

fyi, the ray project created its own Tensor type:
https://docs.ray.io/en/latest/_modules/ray/data/extensions/tensor_extension.html#ArrowTensorArray

@wesm
Copy link
Member

wesm commented Jan 19, 2022

Indeed I think having a built-in Tensor value type (implemented using extension arrays) in Arrow/pyarrow would be better than having third party projects rolling their own.

@frreiss
Copy link

frreiss commented Jan 24, 2022

@wesm would there be interest in folding the Pandas side of these third-party extensions into Pandas also?

@jorisvandenbossche
Copy link
Member

would there be interest in folding the Pandas side of these third-party extensions into Pandas also?

That will be something to discuss in the pandas project.
(speaking as a pandas maintainer, for now we mostly encourage creating third-party extensions (that was the whole purpose of formalizing this ExtensionArray interface in pandas), but at some point we should also expand the types in pandas itself. Although I personally think we should rather start with adding a simple List type, than directly a tensor type)

@rok
Copy link
Member Author

rok commented Mar 30, 2023

Thanks for the fast review @jorisvandenbossche. I've addressed your points.

@rok
Copy link
Member Author

rok commented Mar 31, 2023

@jorisvandenbossche it seems CI is satisfied.

@AlenkaF
Copy link
Member

AlenkaF commented Apr 4, 2023

@rok would it be possible to add a method for value_type similar to shape(), permutation() and dim_names()? I think I can not currently retrieve it in Python.

Copy link
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have connected it to Python and it works well 👍
Would be great to have a method for value_type though, as mentioned in a previous comment.

@rok
Copy link
Member Author

rok commented Apr 4, 2023

@AlenkaF sure, I'll add it.

@rok
Copy link
Member Author

rok commented Apr 4, 2023

@AlenkaF done.

@AlenkaF
Copy link
Member

AlenkaF commented Apr 4, 2023

@rok, you are awesome! 👍

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failing CI is unrelated? (it seems the R failures are being worked on, and the C++ failures are related to LLVM update #34768)

@Batash0
Copy link

Batash0 commented Apr 4, 2023

Great news!

rok and others added 2 commits April 4, 2023 12:51
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@rok
Copy link
Member Author

rok commented Apr 4, 2023

The failing CI is unrelated? (it seems the R failures are being worked on, and the C++ failures are related to LLVM update #34768)

They seem unrelated indeed and I don't think they obscure any new problems as the change was fairly minimal.

@jorisvandenbossche
Copy link
Member

Merged after 2.5 years ;) Thanks @rok!

@rok
Copy link
Member Author

rok commented Apr 4, 2023

Thanks for all the input and reviews everyone, very happy to see this merged!

@jorisvandenbossche now let's talk about strides @ #34797 :D

@ursabot
Copy link

ursabot commented Apr 5, 2023

Benchmark runs are scheduled for baseline = 81c828e and contender = a84a39b. a84a39b is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.56% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.26% ⬆️0.0%] ursa-i9-9960x
[Failed ⬇️0.0% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] a84a39b6 ec2-t3-xlarge-us-east-2
[Failed] a84a39b6 test-mac-arm
[Finished] a84a39b6 ursa-i9-9960x
[Failed] a84a39b6 ursa-thinkcentre-m75q
[Finished] 81c828ed ec2-t3-xlarge-us-east-2
[Failed] 81c828ed test-mac-arm
[Finished] 81c828ed ursa-i9-9960x
[Failed] 81c828ed ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@rok
Copy link
Member Author

rok commented Aug 17, 2023

We started a mailing list discussion about potential VariableShapeTensor extension array, please check it out and give feedback. For more details here's also a PR #37166.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[C++] Add a Tensor logical value type with constant shape, implemented using ExtensionType